In [1]:
# Exploratory Data Analysis
# What is Exploratory Data Analysis (EDA)?
# EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. It is an essential step in the data analysis process, allowing you to understand the data's structure, detect outliers, and identify patterns.
# EDA is typically performed before any formal modeling or hypothesis testing.
# End goal of EDA:
# - To gain insights into the data
# - To prepare the data for further analysis or modeling
# Note: If a typical Machine Learning project has 100% of the time allocated to it, EDA takes up about 70% of that time.
# If the EDA phase is not taken seriously, the rest of the project will be flawed.
# Infact, EDA is the most important step in the data analysis process. You may have to again perform EDA after the modeling phase to understand the model's performance and behavior.
# Therefore, EDA is an iterative process that may require revisiting as new insights are gained or as the data evolves.
# Step-by-step EDA process in the industry:
# 1. Data Collection: Gather data from various sources, such as databases, APIs, or files.
# 2. Data Cleaning: Handle missing values, remove duplicates, and correct inconsistencies in the data.
# 3. Data Exploration: Use descriptive statistics and visualizations to understand the data's distribution, relationships, and patterns.
# 4. Feature Engineering: Create new features or modify existing ones to improve the data's predictive power.
# 5. Data Transformation: Normalize or scale the data, if necessary, to prepare it for modeling.
# 6. Data Visualization: Create visual representations of the data to communicate findings and insights effectively.
# 7. Documentation: Document the EDA process, findings, and any decisions made for future reference.
# Real-time example with respect to Amazon dataset - What happens at every step of EDA as per the above steps:
# 1. Data Collection: Download the Amazon dataset from a public repository or API.
# 2. Data Cleaning: Check for missing values in the dataset like product reviews, remove duplicates like multiple reviews for the same product, and correct inconsistencies like different formats for dates.
# 3. Data Exploration: Use descriptive statistics to summarize the dataset, such as the average rating of products, the distribution of review lengths, and the number of reviews per product.
# 4. Feature Engineering: Create new features like the sentiment score of reviews, the length of the review text, or the time since the last review.
# 5. Data Transformation: Normalize the sentiment scores or scale the review lengths to prepare them for modeling.
# 6. Data Visualization: Create visualizations like histograms of review lengths, bar charts of average ratings per product category, or scatter plots of sentiment scores against review lengths.
# 7. Documentation: Document the EDA process, findings, and any decisions made, such as which features to include in the modeling phase or any patterns observed in the data.
# EDA is a crucial step in the data analysis process, and it is essential to take it seriously to ensure the success of any data-driven project.
In [2]:
# Lets do the necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
In [3]:
# Our Problem Statement - Titanic Dataset EDA
# Load the dataset from a CSV file
# https://raw.githubusercontent.com/ingledarshan/upGrad_Darshan/refs/heads/main/titanic.csv
df = pd.read_csv('https://raw.githubusercontent.com/ingledarshan/upGrad_Darshan/refs/heads/main/titanic.csv')
In [4]:
# Lets check the first 5 rows of the dataset
print("First 5 rows of the dataset:")
df.head()
First 5 rows of the dataset:
Out[4]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [5]:
df.head()
Out[5]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [6]:
# Understanding the meaning of the columns in the dataset:
# PassengerId: Unique identifier for each passenger eg: 1, 2, 3, ...
# Survived: Survival status (0 = No, 1 = Yes)
# Pclass: Passenger class (1 = First class, 2 = Second class, 3 = Third class)
# Name: Name of the passenger eg: "Braund, Mr. Owen Harris", etc.
# Sex: Gender of the passenger (male, female)
# Age: Age of the passenger in years (float)
# SibSp: Number of siblings and spouses aboard the Titanic (integer)
# Parch: Number of parents and children aboard the Titanic (integer)
# Ticket: Ticket number of the passenger (string)
# Fare: Fare paid by the passenger (float) in British pounds
# Cabin: Cabin number of the passenger (string, may contain NaN values)
# Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
In [7]:
# What is the objective of this EDA?
# The objective of this Exploratory Data Analysis (EDA) is to gain insights into the Titanic dataset, understand the factors that may have influenced passenger survival, and identify any patterns or trends in the data. This analysis will help in building predictive models and making data-driven decisions.
# The EDA will include:
# 1. Data Cleaning: Handling missing values, duplicates, and inconsistencies.
# 2. Data Exploration: Analyzing the distribution of key features, such as age, fare, and survival rates.
# 3. Feature Engineering: Creating new features or modifying existing ones to improve the dataset's predictive power.
# 4. Data Visualization: Creating visual representations of the data to communicate findings effectively.
# 5. Documentation: Documenting the EDA process, findings, and any decisions made for future reference.
In [8]:
df.head()
Out[8]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [9]:
# Let formulate some hypotheses questions that we can answer using this dataset:
# 1. What is the overall survival rate of passengers on the Titanic?
# 2. How does survival rate vary by passenger class (Pclass)?
# 3. Is there a significant difference in survival rates between male and female passengers?
# 4. How does age affect the survival rate of passengers?
# 5. What is the distribution of fares paid by passengers, and how does it relate to survival?
# 6. Are there any patterns in the number of siblings/spouses (SibSp) and parents/children (Parch) aboard the Titanic?
# 7. How does the port of embarkation (Embarked) affect survival rates?
# 8. Are there any correlations between the features in the dataset, such as age, fare, and survival?
# 9. What is the distribution of cabin numbers, and how does it relate to survival?
# 10. Are there any notable trends or patterns in the names of passengers that could provide insights into their backgrounds or social status?
# Now, let's start with the EDA process by checking the shape of the dataset
print("Shape of the dataset:", df.shape)
Shape of the dataset: (891, 12)
In [10]:
# Insights from the shape of the dataset:
# The dataset contains 891 rows and 12 columns, indicating that there are 891 passengers with 12 features each.
# This means we have a good amount of data to analyze, which can help us draw meaningful conclusions about the passengers and their survival on the Titanic.
In [11]:
# Let's check the data types of each column to understand the structure of the dataset
print("Data types of each column:")
print(df.dtypes)
Data types of each column: PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
In [12]:
# Insights from the data types:
# The dataset contains a mix of numerical and categorical features.
# Numerical features include: Age, Fare, SibSp, Parch
# Categorical features include: Survived, Pclass, Sex, Embarked, Cabin, Name
# Understanding the data types will help us determine the appropriate analysis techniques and visualizations to use.
In [13]:
# Lets check the information about the dataset
print("Information about the dataset:")
df.info()
Information about the dataset: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
In [14]:
# Insights from the data types:
# The dataset contains a mix of data types:
# - Integer: PassengerId, Pclass, SibSp, Parch
# - Float: Age, Fare
# - Object: Survived, Sex, Embarked, Cabin, Name, Ticket
In [15]:
df.head()
Out[15]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [16]:
# Lets look at the summary statistics of the dataset
print("Summary statistics of the dataset:")
df.describe()
Summary statistics of the dataset:
Out[16]:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
In [17]:
# Lets check for missing values in the dataset
print("Missing values in each column:")
# df.isnull().sum().sort_values(ascending=False)
df.isna().sum().sort_values(ascending=False)
Missing values in each column:
Out[17]:
Cabin 687 Age 177 Embarked 2 PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 SibSp 0 Parch 0 Ticket 0 Fare 0 dtype: int64
In [18]:
# Insights from the missing values:
# The dataset has missing values in the following columns:
# - Age: 177 missing values (19.9% of the total rows)
# - Cabin: 687 missing values (77.1% of the total rows)
# - Embarked: 2 missing values (0.2% of the total rows)
# Handling missing values is crucial for accurate analysis and modeling.
| Function | Purpose | Equivalent? | Preferred? |
|---|---|---|---|
isnull() |
Detects NaN/missing | ✅ Yes | Use what you prefer |
isna() |
Detects NaN/missing | ✅ Yes | Slightly more "Pythonic" |
So go with whichever you find more readable or memorable — they are two names for the same function.
In [19]:
!pip install missingno -q
# missingno is a Python library for visualizing missing data
import missingno as msno
# Visualizing missing values using missingno
msno.matrix(df)
Out[19]:
<Axes: >
In [20]:
msno.bar(df)
Out[20]:
<Axes: >
In [21]:
!pip install sweetviz -q
# sweetviz is a Python library for visualizing and analyzing datasets
# It generates a report that provides insights into the dataset, including missing values, feature distributions, and correlations.
import sweetviz as sv
# Generating a report using sweetviz
report = sv.analyze(df)
# Displaying the report
report.show_html('titanic_preprofiling_report_sv.html')
# The report will be saved as an HTML file named 'titanic_report.html'
# The sweetviz report provides a comprehensive overview of the dataset, including:
# - Feature distributions
# - Missing values
# - Correlations between features
# - Sample data
# This report can help us understand the dataset better and identify any potential issues or patterns that need to be addressed during the EDA process.
| | [ 0%] 00:00 -> (? left)
Report titanic_preprofiling_report_sv.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
In [22]:
# Pandas Profiling is another library that can be used to generate a report for the dataset
# https://github.com/ydataai/ydata-profiling/blob/develop/examples/bank_marketing_data/banking_data.py
!pip install ydata-profiling -q
from pathlib import Path # pathlib is a module in Python that provides an object-oriented interface for working with file system paths. Path is a class in the pathlib module that represents a file system path.
from ydata_profiling import ProfileReport
profile = ProfileReport(
df, title="Profile Report of the Titanic Dataset - Pre-Profiling", explorative=True
)
profile.to_file(Path("titanic_preprofiling_report_pp.html"))
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
In [23]:
df.columns # Display the columns in the DataFrame
Out[23]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
In [24]:
cat_cols = ['Survived', 'Pclass', 'Sex', 'SibSp','Parch', 'Embarked']
# Lets check the unique values in each categorical column
for col in cat_cols:
print(f"Unique values in {col}: {df[col].unique()}")
print("="*143)
# Lets check the value counts of each categorical column
for col in cat_cols:
print(f"Value counts for {col}:")
print(df[col].value_counts())
print("."*143)
print(df[col].value_counts(normalize=True)*100) # Normalized value counts
print("."*143)
df[col].value_counts().plot(kind='bar', title=col)
plt.xticks(rotation=45)
plt.xlabel(col)
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df[col].value_counts()):
plt.text(index, value + 5, str(value), ha='center', va='bottom')
plt.show()
# Insights from the value counts:
# - Most passengers did not survive (0 = No, 1 = Yes), with a survival rate of approximately 38.4%.
# - The majority of passengers were in third class (Pclass = 3), followed by first class (Pclass = 1) and second class (Pclass = 2).
# - Majority of passengers were male (Sex = male), with a smaller proportion of female passengers (Sex = female).
# - Most passengers had no siblings or spouses aboard (SibSp = 0), followed by those with one sibling or spouse (SibSp = 1).
# - Most passengers had no parents or children aboard (Parch = 0), followed by those with one parent or child (Parch = 1).
# - The majority of passengers embarked from Southampton (Embarked = S), followed by Cherbourg (Embarked = C) and Queenstown (Embarked = Q).
Unique values in Survived: [0 1] Unique values in Pclass: [3 1 2] Unique values in Sex: ['male' 'female'] Unique values in SibSp: [1 0 3 4 2 5 8] Unique values in Parch: [0 1 2 5 3 4 6] Unique values in Embarked: ['S' 'C' 'Q' nan] =============================================================================================================================================== Value counts for Survived: Survived 0 549 1 342 Name: count, dtype: int64 ............................................................................................................................................... Survived 0 61.616162 1 38.383838 Name: proportion, dtype: float64 ...............................................................................................................................................
Value counts for Pclass: Pclass 3 491 1 216 2 184 Name: count, dtype: int64 ............................................................................................................................................... Pclass 3 55.106622 1 24.242424 2 20.650954 Name: proportion, dtype: float64 ...............................................................................................................................................
Value counts for Sex: Sex male 577 female 314 Name: count, dtype: int64 ............................................................................................................................................... Sex male 64.758698 female 35.241302 Name: proportion, dtype: float64 ...............................................................................................................................................
Value counts for SibSp: SibSp 0 608 1 209 2 28 4 18 3 16 8 7 5 5 Name: count, dtype: int64 ............................................................................................................................................... SibSp 0 68.237935 1 23.456790 2 3.142536 4 2.020202 3 1.795735 8 0.785634 5 0.561167 Name: proportion, dtype: float64 ...............................................................................................................................................
Value counts for Parch: Parch 0 678 1 118 2 80 5 5 3 5 4 4 6 1 Name: count, dtype: int64 ............................................................................................................................................... Parch 0 76.094276 1 13.243547 2 8.978676 5 0.561167 3 0.561167 4 0.448934 6 0.112233 Name: proportion, dtype: float64 ...............................................................................................................................................
Value counts for Embarked: Embarked S 644 C 168 Q 77 Name: count, dtype: int64 ............................................................................................................................................... Embarked S 72.440945 C 18.897638 Q 8.661417 Name: proportion, dtype: float64 ...............................................................................................................................................
In [25]:
# Find the missing values in each column in the dataset
df.isnull().sum().sort_values(ascending=False)
Out[25]:
Cabin 687 Age 177 Embarked 2 PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 SibSp 0 Parch 0 Ticket 0 Fare 0 dtype: int64
In [26]:
len(df)
Out[26]:
891
In [27]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column: Cabin 77.104377 Age 19.865320 Embarked 0.224467 PassengerId 0.000000 Survived 0.000000 Pclass 0.000000 Name 0.000000 Sex 0.000000 SibSp 0.000000 Parch 0.000000 Ticket 0.000000 Fare 0.000000 dtype: float64
In [28]:
# Insights from the missing values percentage:
# - Age has 19.9% missing values, which is significant and may require imputation or removal.
# - Cabin has 77.1% missing values, which is very high and may not be useful for analysis.
# - Embarked has only 0.2% missing values, which is negligible and can be easily handled.
In [29]:
# Drop the Cabin column as it has too many missing values
# df.drop(columns=['Cabin'], inplace=True)
# or
df.drop(['Cabin'], inplace=True, axis=1)
# Drop the PassengerId column as it is not useful for analysis
In [30]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column: Age 19.865320 Embarked 0.224467 PassengerId 0.000000 Survived 0.000000 Pclass 0.000000 Name 0.000000 Sex 0.000000 SibSp 0.000000 Parch 0.000000 Ticket 0.000000 Fare 0.000000 dtype: float64
In [31]:
# Embarked has only 0.2% missing values, which is negligible and can be easily handled.
# Lets understand the various ways in which a categorical column can have missing values can be handled:
# 1. Drop the rows with missing values in the categorical column
# 2. Fill the missing values with the mode of the categorical column
# 3. Fill the missing values with a placeholder value (e.g., 'Unknown', 'Not Available')
# 4. Use forward fill or backward fill to fill the missing values based on the previous or next value in the column https://miro.medium.com/v2/resize:fit:1348/0*BXk8ZdgFBXNw2yuZ
# 5. Use machine learning algorithms to predict the missing values based on other features in the dataset (advanced technique)
In [32]:
# forward fill or backward fill demonstration
# Creating the dummy student dataset
data = {
'Student_ID': [101, 102, 103, 104, 105, 106],
'Name': ['Alice', 'Bob', np.nan, 'David', np.nan, 'Frank'],
'Grade': ['A', np.nan, 'B', np.nan, 'C', np.nan],
'Marks': [95, np.nan, 88, np.nan, np.nan, 72]
}
demo = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
demo
Original DataFrame with Missing Values:
Out[32]:
| Student_ID | Name | Grade | Marks | |
|---|---|---|---|---|
| 0 | 101 | Alice | A | 95.0 |
| 1 | 102 | Bob | NaN | NaN |
| 2 | 103 | NaN | B | 88.0 |
| 3 | 104 | David | NaN | NaN |
| 4 | 105 | NaN | C | NaN |
| 5 | 106 | Frank | NaN | 72.0 |
In [33]:
# ffill forward fill the missing values in the Name column
demo['Name_ffill'] = demo['Name'].fillna(method='ffill')
# bfill backward fill the missing values in the Name column
demo['Name_bfill'] = demo['Name'].fillna(method='bfill')
demo
Out[33]:
| Student_ID | Name | Grade | Marks | Name_ffill | Name_bfill | |
|---|---|---|---|---|---|---|
| 0 | 101 | Alice | A | 95.0 | Alice | Alice |
| 1 | 102 | Bob | NaN | NaN | Bob | Bob |
| 2 | 103 | NaN | B | 88.0 | Bob | David |
| 3 | 104 | David | NaN | NaN | David | David |
| 4 | 105 | NaN | C | NaN | David | Frank |
| 5 | 106 | Frank | NaN | 72.0 | Frank | Frank |
In [34]:
df.Embarked.mode() # Find the mode of the Embarked column
Out[34]:
0 S Name: Embarked, dtype: object
In [35]:
type(df.Embarked.mode())
Out[35]:
pandas.core.series.Series
In [36]:
df.Embarked.mode()[0] # Get the mode of the Embarked column
Out[36]:
'S'
In [37]:
# Lets fill the missing values in the Embarked column with the mode of the column
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
In [38]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column: Age 19.86532 PassengerId 0.00000 Survived 0.00000 Pclass 0.00000 Name 0.00000 Sex 0.00000 SibSp 0.00000 Parch 0.00000 Ticket 0.00000 Fare 0.00000 Embarked 0.00000 dtype: float64
In [39]:
# Lets fill the missing values in the Age column
# Age column is numerical in nature, so we can fill the missing values with either the mean or median of the column
# Rules of thumb:
# 1. If the data is normally distributed, use the mean to fill missing values.
# 2. If the data is skewed, use the median to fill missing values.
# Lets check the distribution of the Age column
sns.histplot(df['Age'], kde=True)
plt.axvline(df['Age'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Age'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Age with Mean and Median')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Lets calculate the skewness of the Age column
print("Skewness of the Age column:", df['Age'].skew())
# If this value is between:
# a) -0.5 and 0.5, the distribution of the value is almost symmetrical
# b) -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate.
# c) If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed.
Skewness of the Age column: 0.38910778230082704
In [40]:
# Credit: Shivamsuman
# Insights from the skewness:
# Looking at zero value, It doesn’t seems like age ==0 has do many passengers
In [41]:
# Since the skewness of the Age column is 0.389, which is between -0.5 and 0.5, the distribution is almost symmetrical.
# Therefore, we can use the mean to fill the missing values in the Age column
df['Age'].fillna(df['Age'].mean(), inplace=True)
In [42]:
# Lets check the distribution of the Age column
sns.histplot(df['Age'], kde=True)
plt.axvline(df['Age'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Age'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Age with Mean and Median')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Lets calculate the skewness of the Age column
print("Skewness of the Age column:", df['Age'].skew())
# If this value is between:
# a) -0.5 and 0.5, the distribution of the value is almost symmetrical
# b) -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate.
# c) If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed.
Skewness of the Age column: 0.4344880940129925
In [43]:
# Find the percentage of missing values in each column in the dataset
missing_percentage = df.isnull().sum() / len(df) * 100
print("Percentage of missing values in each column:")
print(missing_percentage.sort_values(ascending=False))
Percentage of missing values in each column: PassengerId 0.0 Survived 0.0 Pclass 0.0 Name 0.0 Sex 0.0 Age 0.0 SibSp 0.0 Parch 0.0 Ticket 0.0 Fare 0.0 Embarked 0.0 dtype: float64
In [44]:
# Now, the dataset has no missing values
# Lets check the shape of the dataset again
print("Shape of the dataset after handling missing values:", df.shape)
Shape of the dataset after handling missing values: (891, 11)
In [45]:
# Let us check for duplicates in the dataset
print("Number of duplicate rows in the dataset:", df.duplicated().sum())
# Insights from the duplicate rows:
# There are no duplicate rows in the dataset, which is good as it ensures that each passenger is unique and there are no repeated entries.
Number of duplicate rows in the dataset: 0
In [46]:
# Now that the dataset is clean, we can proceed with the Feature Engineering step of the EDA process.
# What is Feature Engineering?
# Feature - Column, Engineering - Creating
# Feature Engineering is the process of creating new and more meaningful features from the existing features in the dataset.
# Feature Engineering is the process of using domain knowledge to extract features (variables) that make machine learning algorithms work. It involves creating new features or modifying existing ones to improve the predictive power of the dataset.
# Feature Engineering can include:
# 1. Creating new features from existing ones (e.g., extracting titles from names, creating age groups)
# 2. Encoding categorical variables (e.g., converting 'Sex' and 'Embarked' columns into numerical format)
# 3. Normalizing or scaling numerical features (e.g., scaling 'Fare' and 'Age' columns)
# 4. Handling outliers (e.g., removing or transforming extreme values)
# 5. Creating interaction features (e.g., combining 'Pclass' and 'Sex' to create a new feature)
# General examples of Feature Engineering:
# 1. Calculating the Body Mass Index (BMI) from weight and height features in a health dataset.
# 2. Extracting the year, month, and day from a date feature in a time series dataset.
# 3. If a dataset contains a timestamp, creating new features like hour, day of the week, or month to capture temporal patterns.
# 4. If I have a column called as Date of Birth, I can create a new column called Age by subtracting the Date of Birth from the current date.
# 5. If I have a columns like basic salary, dearness allowance, house rent allowance, I can create a new column called Gross Salary by adding these columns together.
df.head()
Out[46]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S |
In [47]:
# Lets engineer some features in the Titanic dataset
# 1. FamilySize = SibSp + Parch + 1 (adding 1 to include the passenger themselves)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df.head()
Out[47]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 |
In [48]:
# 2. isAlone = 1 if FamilySize == 1 else 0
# df['isAlone'] = np.where(df['FamilySize'] == 1, 1, 0)
# or
df['isAlone'] = df['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
df.head()
Out[48]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 |
In [49]:
df["isAlone"].value_counts().plot(kind='bar', title='isAlone')
plt.xticks(rotation=0)
plt.xlabel('isAlone')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df["isAlone"].value_counts()):
plt.text(index, value, str(value), ha='center', va='bottom')
plt.show()
# Insights from the isAlone feature:
# - Most passengers were alone (isAlone = 1), with a count of 537.
# - A smaller proportion of passengers traveled with family (isAlone = 0), with a count of 354.
In [50]:
df.head()
Out[50]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 |
In [51]:
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(","))
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1])
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1].split("."))
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1].split(".")[0])
print("=" * 143)
print("Futrelle, Mrs. Jacques Heath (Lily May Peel)".split(",")[1].split(".")[0].strip())
print("Braund, Mr. Owen Harris".split(",")[1].split(".")[0].strip())
print("Heikkinen, Miss. Laina".split(",")[1].split(".")[0].strip())
print("Allen, Mr. William Henry".split(",")[1].split(".")[0].strip())
['Futrelle', ' Mrs. Jacques Heath (Lily May Peel)'] Mrs. Jacques Heath (Lily May Peel) [' Mrs', ' Jacques Heath (Lily May Peel)'] Mrs =============================================================================================================================================== Mrs Mr Miss Mr
In [52]:
# 3. Title extraction from Name column
# Way1 - User-defined function to extract title from Name column
# def extract_title(name):
# # Split the name by comma and then by dot to get the title
# return name.split(",")[1].split(".")[0].strip()
# # Apply the function to the Name column to create a new Title column
# df['Title'] = df['Name'].apply(extract_title)
# or
# Way2 - lambda function
df['Title'] = df['Name'].apply(lambda x: x.split(",")[1].split(".")[0].strip())
df.head()
Out[52]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [53]:
df.head()
Out[53]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [54]:
# Lets analyze the Fare column
df["Fare"].describe()
# Insights from the Fare column:
# - The average fare paid by passengers is approximately 32.20 British pounds.
# - The minimum fare is 0.00 British pounds, which indicates that some passengers may have traveled for free or at a very low cost.
# - The maximum fare is 512.33 British pounds, which indicates that some passengers paid significantly higher fares.
# - The standard deviation of the fare is approximately 49.69, indicating a wide range of fares paid by passengers.
# Visualizing the distribution of the Fare column
sns.histplot(df['Fare'], kde=True)
plt.axvline(df['Fare'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Fare'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.axvline(df['Fare'].quantile(0.75), color='blue', linestyle='dotted', linewidth=1, label='75th Percentile')
plt.axvline(df['Fare'].quantile(0.85), color='purple', linestyle='dotted', linewidth=1, label='85th Percentile')
plt.axvline(df['Fare'].quantile(0.90), color='orange', linestyle='dotted', linewidth=1, label='95th Percentile')
plt.axvline(df['Fare'].quantile(0.95), color='orange', linestyle='dotted', linewidth=1, label='95th Percentile')
plt.axvline(df['Fare'].quantile(0.99), color='brown', linestyle='dotted', linewidth=1, label='99th Percentile')
plt.axvline(df['Fare'].quantile(1.00), color='orange', linestyle='dotted', linewidth=1, label='95th Percentile')
plt.title('Distribution of Fare with Mean and Median')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.legend()
plt.show()
# Insights from the Fare distribution:
# - The distribution of the Fare column is right-skewed, with a long tail towards higher fares.
# - The mean fare is higher than the median fare, indicating that there are some passengers who paid significantly higher fares.
# - The majority of passengers paid fares below 100 British pounds, with a few passengers paying much higher fares.
In [55]:
# Lets find the passengers who paid zero
zero_fare_passengers = df[df['Fare'] == 0]
zero_fare_passengers
# Insights from the zero fare passengers:
# - There are 15 passengers who paid a fare of 0.00 British pounds.
# These most likely represent passengers who were crew members because:
# 1. All are adults (age > 18)
# 2. All are males
# 3. All embarked from Southampton (Embarked = S)
# 4. All have isAlone = 1 (indicating they traveled alone)
Out[55]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 179 | 180 | 0 | 3 | Leonard, Mr. Lionel | male | 36.000000 | 0 | 0 | LINE | 0.0 | S | 1 | 1 | Mr |
| 263 | 264 | 0 | 1 | Harrison, Mr. William | male | 40.000000 | 0 | 0 | 112059 | 0.0 | S | 1 | 1 | Mr |
| 271 | 272 | 1 | 3 | Tornquist, Mr. William Henry | male | 25.000000 | 0 | 0 | LINE | 0.0 | S | 1 | 1 | Mr |
| 277 | 278 | 0 | 2 | Parkes, Mr. Francis "Frank" | male | 29.699118 | 0 | 0 | 239853 | 0.0 | S | 1 | 1 | Mr |
| 302 | 303 | 0 | 3 | Johnson, Mr. William Cahoone Jr | male | 19.000000 | 0 | 0 | LINE | 0.0 | S | 1 | 1 | Mr |
| 413 | 414 | 0 | 2 | Cunningham, Mr. Alfred Fleming | male | 29.699118 | 0 | 0 | 239853 | 0.0 | S | 1 | 1 | Mr |
| 466 | 467 | 0 | 2 | Campbell, Mr. William | male | 29.699118 | 0 | 0 | 239853 | 0.0 | S | 1 | 1 | Mr |
| 481 | 482 | 0 | 2 | Frost, Mr. Anthony Wood "Archie" | male | 29.699118 | 0 | 0 | 239854 | 0.0 | S | 1 | 1 | Mr |
| 597 | 598 | 0 | 3 | Johnson, Mr. Alfred | male | 49.000000 | 0 | 0 | LINE | 0.0 | S | 1 | 1 | Mr |
| 633 | 634 | 0 | 1 | Parr, Mr. William Henry Marsh | male | 29.699118 | 0 | 0 | 112052 | 0.0 | S | 1 | 1 | Mr |
| 674 | 675 | 0 | 2 | Watson, Mr. Ennis Hastings | male | 29.699118 | 0 | 0 | 239856 | 0.0 | S | 1 | 1 | Mr |
| 732 | 733 | 0 | 2 | Knight, Mr. Robert J | male | 29.699118 | 0 | 0 | 239855 | 0.0 | S | 1 | 1 | Mr |
| 806 | 807 | 0 | 1 | Andrews, Mr. Thomas Jr | male | 39.000000 | 0 | 0 | 112050 | 0.0 | S | 1 | 1 | Mr |
| 815 | 816 | 0 | 1 | Fry, Mr. Richard | male | 29.699118 | 0 | 0 | 112058 | 0.0 | S | 1 | 1 | Mr |
| 822 | 823 | 0 | 1 | Reuchlin, Jonkheer. John George | male | 38.000000 | 0 | 0 | 19972 | 0.0 | S | 1 | 1 | Jonkheer |
In [56]:
# Extract row where Title is "Capt"
captain_passengers = df[df['Title'] == 'Capt']
captain_passengers
# Insights from the captain passengers:
# - There is only one passenger with the title "Capt" (Captain).
# - The captain is a 70-year-old male.
# He was traveling with his family probably as he has a family size of 3.
# The Fare paid is not zero, because he is a captain and he would have paid a fare for his family.
# But, there must be a suspicion that needs to be considered here: Is he really the captain of the Titanic or is he a captain of some other ship or a captain in the army? -> Needs more data to confirm this.
# However, we can assume that he is the captain of the Titanic as well because of the following reasons:
# 1. He is the only passenger with the title "Capt" in the dataset.
# 2. He is senior in age (70 years old), which is a common age for captains.
# 3. He did not survive the Titanic disaster, inspite of being a senior citizen.
Out[56]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 745 | 746 | 0 | 1 | Crosby, Capt. Edward Gifford | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0 | S | 3 | 0 | Capt |
In [57]:
# Is my Title column perfect?
# Lets check the unique values in the Title column
print("Unique values in Title column:", df['Title'].unique())
print("Value counts for Title column:")
print(df['Title'].value_counts())
# Insights from the Title column:
# - The Title column contains various titles such as 'Mr', 'Mrs', 'Miss', 'Master', 'Dr', 'Rev', 'Col', 'Major', 'Mlle', 'Mme', 'Don', 'Dona', and 'Jonkheer'.
# - Some titles are less common, such as 'Mlle', 'Mme', 'Don', 'Dona', and 'Jonkheer'.
# - There are some titles that can be grouped together, such as 'Mlle' and 'Mme' (both are French titles for Miss and Mrs, respectively), and 'Don' and 'Dona' (Spanish titles for Mr and Mrs, respectively).
# Lets group the titles into broader categories
# Grouping titles into broader categories
title_mapping = {
'Mr': 'Mr',
'Mrs': 'Mrs',
'Miss': 'Miss',
'Master': 'Master',
'Don': 'Mr',
'Rev': 'Mr',
'Dr': 'Mr',
'Mme': 'Mrs',
'Ms': 'Miss',
'Major': 'Mr',
'Lady': 'Mrs',
'Sir': 'Mr',
'Mlle': 'Miss',
'Col': 'Mr',
'Capt': 'Mr',
'the Countess': 'Mrs',
'Jonkheer': 'Mr'
}
# Apply the mapping to the Title column
df['Title'] = df['Title'].map(title_mapping)
# Check the unique values in the Title column after mapping
print("Unique values in Title column after mapping:", df['Title'].unique())
# Check the value counts for the Title column after mapping
print("Value counts for Title column after mapping:")
print(df['Title'].value_counts())
# Plotting the value counts for the Title column after mapping
df['Title'].value_counts().plot(kind='bar', title='Title')
plt.xticks(rotation=0)
plt.xlabel('Title')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df['Title'].value_counts()):
plt.text(index, value + 5, str(value), ha='center', va='bottom')
plt.show()
# Insights from the Title column after mapping:
Unique values in Title column: ['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady' 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer'] Value counts for Title column: Title Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: count, dtype: int64 Unique values in Title column after mapping: ['Mr' 'Mrs' 'Miss' 'Master'] Value counts for Title column after mapping: Title Mr 538 Miss 185 Mrs 128 Master 40 Name: count, dtype: int64
Logical Grouping and Justification:¶
| Rare Title | Mapped To | Justification |
|---|---|---|
| Mr | Mr | Already common |
| Mrs | Mrs | Already common |
| Miss | Miss | Already common |
| Master | Master | Already common |
| Don, Sir, Jonkheer | Mr | Noblemen or male honorifics |
| Dr, Rev, Col, Capt, Major | Mr | Male professionals or officers |
| Mme, the Countess, Lady | Mrs | Married women or titled nobility |
| Ms, Mlle | Miss | Variants or equivalents of Miss |
In [58]:
df.Title.value_counts(normalize=True) * 100
Out[58]:
Title Mr 60.381594 Miss 20.763187 Mrs 14.365881 Master 4.489338 Name: proportion, dtype: float64
In [59]:
df.head()
Out[59]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [60]:
# Lets find the correlation between the numerical columns in the dataset
# Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.
# It ranges from -1 to 1, where:
# -1 indicates a perfect negative correlation (as one variable increases, the other decreases)
# 0 indicates no correlation (no relationship between the variables)
# 1 indicates a perfect positive correlation (as one variable increases, the other also increases)
# Correlation can be calculated using the Pearson correlation coefficient, which is the most common method for measuring correlation.
# In Pandas, we can use the corr() method to calculate the correlation between numerical columns in a DataFrame.
# Let's check the correlation between the numerical columns in the dataset
print("Correlation between numerical columns in the dataset:")
# Using numeric_only=True to avoid TypeError: unsupported operand type(s) for +: 'int' and 'str'
cor = df.corr(numeric_only=True)
# Display the correlation matrix
cor
Correlation between numerical columns in the dataset:
Out[60]:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | FamilySize | isAlone | |
|---|---|---|---|---|---|---|---|---|---|
| PassengerId | 1.000000 | -0.005007 | -0.035144 | 0.033207 | -0.057527 | -0.001652 | 0.012658 | -0.040143 | 0.057462 |
| Survived | -0.005007 | 1.000000 | -0.338481 | -0.069809 | -0.035322 | 0.081629 | 0.257307 | 0.016639 | -0.203367 |
| Pclass | -0.035144 | -0.338481 | 1.000000 | -0.331339 | 0.083081 | 0.018443 | -0.549500 | 0.065997 | 0.135207 |
| Age | 0.033207 | -0.069809 | -0.331339 | 1.000000 | -0.232625 | -0.179191 | 0.091566 | -0.248512 | 0.179775 |
| SibSp | -0.057527 | -0.035322 | 0.083081 | -0.232625 | 1.000000 | 0.414838 | 0.159651 | 0.890712 | -0.584471 |
| Parch | -0.001652 | 0.081629 | 0.018443 | -0.179191 | 0.414838 | 1.000000 | 0.216225 | 0.783111 | -0.583398 |
| Fare | 0.012658 | 0.257307 | -0.549500 | 0.091566 | 0.159651 | 0.216225 | 1.000000 | 0.217138 | -0.271832 |
| FamilySize | -0.040143 | 0.016639 | 0.065997 | -0.248512 | 0.890712 | 0.783111 | 0.217138 | 1.000000 | -0.690922 |
| isAlone | 0.057462 | -0.203367 | 0.135207 | 0.179775 | -0.584471 | -0.583398 | -0.271832 | -0.690922 | 1.000000 |
In [61]:
# Plotting the correlation matrix using seaborn heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(cor, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Columns')
plt.show()
In [62]:
# Insights from the correlation matrix:
# - The 'Survived' column has a positive correlation with 'Pclass' (0.34), indicating that passengers in higher classes (1st class) had a higher survival rate.
# - The 'Survived' column has a negative correlation with 'Fare' (-0.09), indicating that passengers who paid higher fares had a slightly lower survival rate.
# - The 'Age' column has a weak positive correlation with 'Fare' (0.09), indicating that older passengers tended to pay higher fares.
# - The 'SibSp' and 'Parch' columns have a weak positive correlation (0.41), indicating that passengers with more siblings/spouses tended to have more parents/children aboard.
# - The 'FamilySize' column has a positive correlation with 'SibSp' (0.69) and 'Parch' (0.68), indicating that larger families tended to have more siblings/spouses and parents/children aboard.
# - The 'FamilySize' column has a weak negative correlation with 'Survived' (-0.08), indicating that larger families had a slightly lower survival rate.
# - The 'isAlone' column has a weak negative correlation with 'Survived' (-0.18), indicating that passengers who were alone had a lower survival rate compared to those who traveled with family.
# - The 'Title' column is categorical, so it does not have a correlation with numerical columns, but it can be analyzed separately.
In [63]:
df.head()
Out[63]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [64]:
# Lets understand the Survived column in detail
df.Survived.value_counts().plot(kind='bar', title='Survived')
plt.xticks(rotation=0)
plt.xlabel('Survived')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df.Survived.value_counts()):
plt.text(index, value + 5, str(value), ha='center', va='bottom')
plt.show()
# Insights from the Survived column:
# - The majority of passengers did not survive (0 = No, 1 = Yes), with a survival rate of approximately 38.4%.
In [65]:
df.head()
Out[65]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [66]:
df.groupby('Pclass')['Survived'].mean()
Out[66]:
Pclass 1 0.629630 2 0.472826 3 0.242363 Name: Survived, dtype: float64
In [67]:
df.groupby('Pclass')[['Survived']].mean()
Out[67]:
| Survived | |
|---|---|
| Pclass | |
| 1 | 0.629630 |
| 2 | 0.472826 |
| 3 | 0.242363 |
In [68]:
df.groupby('Pclass')[['Survived']].mean().index
Out[68]:
Index([1, 2, 3], dtype='int64', name='Pclass')
In [69]:
df.groupby('Pclass')[['Survived']].mean().values
Out[69]:
array([[0.62962963],
[0.47282609],
[0.24236253]])
In [70]:
# How is Pclass related to Survived?
print(df.groupby('Pclass')['Survived'].mean())
# Insights:
# - The survival rate is highest for passengers in 1st class (Pclass = 1) at approximately 62.96%.
# - The survival rate is lower for passengers in 2nd class (Pclass = 2) at approximately 47.28%.
# - The survival rate is lowest for passengers in 3rd class (Pclass = 3) at approximately 24.24%.
# This indicates that passengers in higher classes had a significantly higher chance of survival compared to those in lower classes.
df.groupby('Pclass')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Passenger Class (Pclass)')
plt.xlabel('Passenger Class (Pclass)')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
# Plot text on top of the bars
for index, value in enumerate(df.groupby('Pclass')['Survived'].mean()):
plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Pclass 1 0.629630 2 0.472826 3 0.242363 Name: Survived, dtype: float64
In [71]:
# Visualizing the relationship between Pclass and Survived
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title('Survival Count by Passenger Class (Pclass)')
plt.xlabel('Passenger Class (Pclass)')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
In [72]:
df.head()
Out[72]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [73]:
# How is Sex related to Survived?
print(df.groupby('Sex')['Survived'].mean())
# Insights:
# - The survival rate is higher for females (Sex = female) at approximately 74.2%.
# - The survival rate is lower for males (Sex = male) at approximately 18.9%.
# This indicates that females had a significantly higher chance of survival compared to males.
df.groupby('Sex')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Sex')
plt.xlabel('Sex')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
# Plot text on top of the bars
for index, value in enumerate(df.groupby('Sex')['Survived'].mean()):
plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64
In [74]:
# Visualizing the relationship between Sex and Survived
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title('Survival Count by Sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
In [75]:
df.head()
Out[75]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [76]:
# How is Embarked related to Survived?
print(df.groupby('Embarked')['Survived'].mean())
# Insights:
# - The survival rate is higher for passengers who embarked at Cherbourg (Embarked = C) at approximately 55.0%.
# - The survival rate is lower for passengers who embarked at Southampton (Embarked = S) at approximately 33.0%.
# - The survival rate is lowest for passengers who embarked at Queenstown (Embarked = Q) at approximately 23.0%.
# This indicates that the place of embarkation had an impact on the survival rate.
df.groupby('Embarked')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Embarked')
plt.xlabel('Embarked')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
# Plot text on top of the bars
for index, value in enumerate(df.groupby('Embarked')['Survived'].mean()):
plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Embarked C 0.553571 Q 0.389610 S 0.339009 Name: Survived, dtype: float64
In [77]:
# Visualizing the relationship between Embarked and Survived
sns.countplot(x='Embarked', hue='Survived', data=df)
plt.title('Survival Count by Embarked')
plt.xlabel('Embarked')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
In [78]:
df.head()
Out[78]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [79]:
# How is isAlone related to Survived?
print(df.groupby('isAlone')['Survived'].mean())
# Insights:
# - The survival rate is higher for passengers who were not alone (isAlone = 0) at approximately 30.0%.
# - The survival rate is lower for passengers who were alone (isAlone = 1) at approximately 50.0%.
# This indicates that passengers who traveled with family had a higher chance of survival compared to those who were alone.
df.groupby('isAlone')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by isAlone')
plt.xlabel('isAlone')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
# Plot text on top of the bars
for index, value in enumerate(df.groupby('isAlone')['Survived'].mean()):
plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
isAlone 0 0.505650 1 0.303538 Name: Survived, dtype: float64
In [80]:
# Visualizing the relationship between isAlone and Survived
sns.countplot(x='isAlone', hue='Survived', data=df)
plt.title('Survival Count by isAlone')
plt.xlabel('isAlone')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
In [81]:
df.head()
Out[81]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [82]:
# How is Title related to Survived?
print(df.groupby('Title')['Survived'].mean())
# Insights:
# - The survival rate is higher for Mrs (Title = Mrs) at approximately 78.0%.
# - The survival rate is lower for Mr (Title = Mr) at approximately 15.0%.
df.groupby('Title')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by Title')
plt.xlabel('Title')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
# Plot text on top of the bars
for index, value in enumerate(df.groupby('Title')['Survived'].mean()):
plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
Title Master 0.575000 Miss 0.702703 Mr 0.161710 Mrs 0.796875 Name: Survived, dtype: float64
In [83]:
# Visualizing the relationship between Title and Survived
sns.countplot(x='Title', hue='Survived', data=df)
plt.title('Survival Count by Title')
plt.xlabel('Title')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
In [84]:
df.head()
Out[84]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [85]:
# Feature Engineering:
# - Create a new feature "GenderClass":
# If Age < 15, GenderClass = "child"
# Else GenderClass = "male" or "female" as per "Sex" column
def create_gender_class(titanic_df):
if titanic_df['Age'] < 15:
return "child"
else:
return titanic_df['Sex']
df['GenderClass'] = df.apply(create_gender_class, axis=1)
df["GenderClass"].value_counts().plot(kind='bar', title='GenderClass')
plt.xticks(rotation=0)
plt.xlabel('GenderClass')
plt.ylabel('Count')
# Plot text on top of the bars
for index, value in enumerate(df["GenderClass"].value_counts()):
plt.text(index, value, f"{value}", ha='center', va='bottom')
plt.show()
In [ ]:
df["GenderClass"].value_counts()
# Insights from the GenderClass feature:
# - There were 78 children (Age < 15) aboard the Titanic.
Out[ ]:
GenderClass male 538 female 275 child 78 Name: count, dtype: int64
In [87]:
df.head()
Out[87]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male |
In [ ]:
# How is GenderClass related to Survived?
print(df.groupby('GenderClass')['Survived'].mean())
# Insights:
# you can write the insights here
df.groupby('GenderClass')[['Survived']].mean().plot(kind='bar')
plt.title('Average Survival Rate by GenderClass')
plt.xlabel('GenderClass')
plt.ylabel('Average Survival Rate')
plt.xticks(rotation=0)
# Plot text on top of the bars
for index, value in enumerate(df.groupby('GenderClass')['Survived'].mean()):
plt.text(index, value, f"{value:.2f}", ha='center', va='bottom')
plt.show()
GenderClass child 0.576923 female 0.760000 male 0.163569 Name: Survived, dtype: float64
In [89]:
# Visualizing the relationship between GenderClass and Survived
sns.countplot(x='GenderClass', hue='Survived', data=df)
plt.title('Survival Count by GenderClass')
plt.xlabel('GenderClass')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
# Plot text on top of the bars
for p in plt.gca().patches:
plt.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')
plt.show()
In [90]:
df.head()
Out[90]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male |
In [91]:
# Drop "GenderClass" feature
df = df.drop(columns=['GenderClass'])
# or
# df.drop(['GenderClass'], inplace=True, axis=1)
df.head()
Out[91]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr |
In [93]:
# Create "GenderClass" column again with a different approach - lambda function
df['GenderClass'] = df.apply(lambda titanic_df: 'child' if titanic_df['Age'] < 15 else titanic_df["Sex"], axis=1)
df["GenderClass"].value_counts()
Out[93]:
GenderClass male 538 female 275 child 78 Name: count, dtype: int64
In [94]:
df.head()
Out[94]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male |
In [95]:
# Binning Age into categories
# Logic:
# 1. If Age < 15, AgeGroup = "child"
# 2. If Age >= 15 and Age < 30, AgeGroup = "young"
# 3. If Age >= 30 and Age < 50, AgeGroup = "middle-aged"
# 4. If Age >= 50, AgeGroup = "senior"
def create_age_group(titanic_df):
if titanic_df['Age'] < 15:
return 'child'
elif 15 <= titanic_df['Age'] < 30:
return 'young'
elif 30 <= titanic_df['Age'] < 50:
return 'middle-aged'
else:
return 'senior'
df['AgeGroup'] = df.apply(create_age_group, axis=1)
df.head()
Out[95]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male | young |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female | middle-aged |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female | young |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female | middle-aged |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male | middle-aged |
In [ ]:
df["AgeGroup"].value_counts()
# Insights from the AgeGroup feature:
# Most passengers were young (AgeGroup = young), followed by middle-aged (AgeGroup = middle-aged), further followed by children (AgeGroup = child), further followed by seniors (AgeGroup = senior).
Out[ ]:
AgeGroup young 483 middle-aged 256 child 78 senior 74 Name: count, dtype: int64
In [97]:
# Drop "AgeGroup" feature
df = df.drop(columns=['AgeGroup'])
# or
# df.drop(['AgeGroup'], inplace=True, axis=1)
df.head()
Out[97]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male |
In [ ]:
# Binning Age into categories using a different approach - pd.cut()
# Logic:
# 1. If Age < 15, AgeGroup = "child"
# 2. If Age >= 15 and Age < 30, AgeGroup = "young"
# 3. If Age >= 30 and Age < 50, AgeGroup = "middle-aged"
# 4. If Age >= 50, AgeGroup = "senior"
bins = [0, 15, 30, 50, 100]
labels = ['child', 'young', 'middle-aged', 'senior']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
# right=False - it means include left value and exclude right value
df.head()
Out[ ]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male | young |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female | middle-aged |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female | young |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female | middle-aged |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male | middle-aged |
In [99]:
df["AgeGroup"].value_counts()
Out[99]:
AgeGroup young 483 middle-aged 256 child 78 senior 74 Name: count, dtype: int64
In [100]:
df.head()
Out[100]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male | young |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female | middle-aged |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female | young |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female | middle-aged |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male | middle-aged |
In [104]:
# Ramani
# Can we correlate the females that survived, whether their children survived or not?
# Filter out all Title, except "Mr"
except_Mr_passengers = df[df['Title'] != 'Mr']
except_Mr_passengers.head()
Out[104]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female | middle-aged |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female | young |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female | middle-aged |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | S | 5 | 0 | Master | child | child |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | S | 3 | 0 | Mrs | female | young |
In [105]:
except_Mr_passengers.shape # (number of female passengers, number of columns)
Out[105]:
(353, 16)
In [106]:
except_Mr_passengers.Title.value_counts()
Out[106]:
Title Miss 185 Mrs 128 Master 40 Name: count, dtype: int64
In [108]:
df.head()
Out[108]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 2 | 0 | Mr | male | young |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 2 | 0 | Mrs | female | middle-aged |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 | Miss | female | young |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 2 | 0 | Mrs | female | middle-aged |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 1 | 1 | Mr | male | middle-aged |
In [109]:
# ML Problem Statement:
# Aim: To predict the survival of passengers based on their features.
columns_to_drop = ['PassengerId', 'Name', "Sex", "SibSp", "Parch", "Ticket"]
# Dropping unnecessary columns
df.drop(columns=columns_to_drop, inplace=True)
# or
# df = df.drop(columns=columns_to_drop)
# Display the first few rows of the cleaned DataFrame
df.head()
Out[109]:
| Survived | Pclass | Age | Fare | Embarked | FamilySize | isAlone | Title | GenderClass | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 22.0 | 7.2500 | S | 2 | 0 | Mr | male | young |
| 1 | 1 | 1 | 38.0 | 71.2833 | C | 2 | 0 | Mrs | female | middle-aged |
| 2 | 1 | 3 | 26.0 | 7.9250 | S | 1 | 1 | Miss | female | young |
| 3 | 1 | 1 | 35.0 | 53.1000 | S | 2 | 0 | Mrs | female | middle-aged |
| 4 | 0 | 3 | 35.0 | 8.0500 | S | 1 | 1 | Mr | male | middle-aged |
In [110]:
columns_to_dummify = ["Embarked", "Title", "GenderClass", "AgeGroup"]
# Creating dummy variables for categorical columns
df = pd.get_dummies(df, columns=columns_to_dummify, drop_first=True, dtype=int)
# drop_first=True - it means drop the first category of each categorical column to avoid multicollinearity
# Display the first few rows of the DataFrame after creating dummy variables
df.head()
Out[110]:
| Survived | Pclass | Age | Fare | FamilySize | isAlone | Embarked_Q | Embarked_S | Title_Miss | Title_Mr | Title_Mrs | GenderClass_female | GenderClass_male | AgeGroup_young | AgeGroup_middle-aged | AgeGroup_senior | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 22.0 | 7.2500 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 38.0 | 71.2833 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 2 | 1 | 3 | 26.0 | 7.9250 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 1 | 1 | 35.0 | 53.1000 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 4 | 0 | 3 | 35.0 | 8.0500 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
In [111]:
numerical_cols = ['Age', 'Fare']
# Scale numerical columns - StandardScaler
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
df[numerical_cols] = ss.fit_transform(df[numerical_cols])
# Display the first few rows of the DataFrame after scaling numerical columns
df.head()
Out[111]:
| Survived | Pclass | Age | Fare | FamilySize | isAlone | Embarked_Q | Embarked_S | Title_Miss | Title_Mr | Title_Mrs | GenderClass_female | GenderClass_male | AgeGroup_young | AgeGroup_middle-aged | AgeGroup_senior | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | -0.592481 | -0.502445 | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 0.638789 | 0.786845 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 2 | 1 | 3 | -0.284663 | -0.488854 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 1 | 1 | 0.407926 | 0.420730 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 4 | 0 | 3 | 0.407926 | -0.486337 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
In [112]:
df.describe()
Out[112]:
| Survived | Pclass | Age | Fare | FamilySize | isAlone | Embarked_Q | Embarked_S | Title_Miss | Title_Mr | Title_Mrs | GenderClass_female | GenderClass_male | AgeGroup_young | AgeGroup_middle-aged | AgeGroup_senior | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 8.910000e+02 | 8.910000e+02 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 0.383838 | 2.308642 | 2.232906e-16 | 3.987333e-18 | 1.904602 | 0.602694 | 0.086420 | 0.725028 | 0.207632 | 0.603816 | 0.143659 | 0.308642 | 0.603816 | 0.542088 | 0.287318 | 0.083053 |
| std | 0.486592 | 0.836071 | 1.000562e+00 | 1.000562e+00 | 1.613459 | 0.489615 | 0.281141 | 0.446751 | 0.405840 | 0.489378 | 0.350940 | 0.462192 | 0.489378 | 0.498505 | 0.452765 | 0.276117 |
| min | 0.000000 | 1.000000 | -2.253155e+00 | -6.484217e-01 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 2.000000 | -5.924806e-01 | -4.891482e-01 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 3.000000 | 0.000000e+00 | -3.573909e-01 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 3.000000 | 4.079260e-01 | -2.424635e-02 | 2.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
| max | 1.000000 | 3.000000 | 3.870872e+00 | 9.667167e+00 | 11.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
In [114]:
# Plot the distribution of the scaled numerical columns
# Plotting the distribution of the scaled Age column
sns.histplot(df['Age'], kde=True)
plt.axvline(df['Age'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Age'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Scaled Age with Mean and Median')
plt.xlabel('Scaled Age')
plt.ylabel('Frequency')
plt.legend()
plt.show()
In [115]:
# Plotting the distribution of the scaled Fare column
sns.histplot(df['Fare'], kde=True)
plt.axvline(df['Fare'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(df['Fare'].median(), color='green', linestyle='dotted', linewidth=1, label='Median')
plt.title('Distribution of Scaled Fare with Mean and Median')
plt.xlabel('Scaled Fare')
plt.ylabel('Frequency')
plt.legend()
plt.show()